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Abstract 

We consider the problem of estimating the unknown response function in the Gaussian white 
noise model. We first utilize the recently developed Bayesian maximum a posteriori "testimation" 
procedure of Abramovich et al. (2007) for recovering an unknown high-dimensional Gaussian mean 
vector. The existing results for its upper error bounds over various sparse / p -balls are extended to 
more general cases. We show that, for a properly chosen prior on the number of non-zero entries of 
the mean vector, the corresponding adaptive estimator is asymptotically minimax in a wide range of 
sparse and dense Z p -balls. 

The proposed procedure is then applied in a wavelet context to derive adaptive global and level- 
wise wavelet estimators of the unknown response function in the Gaussian white noise model. These 
estimators are then proven to be, respectively, asymptotically near-minimax and minimax in a wide 
range of Besov balls. These results are also extended to the estimation of derivatives of the response 
function. 

Simulated examples are conducted to illustrate the performance of the proposed level-wise wavelet 
estimator in finite sample situations, and to compare it with several existing counterparts. 
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1 Introduction 



We consider the problem of estimating the unknown response function in the Gaussian white noise 
model, where one observes Gaussian processes Y n (t) governed by 

dY n (t) = f(t)dt + -^-dW(t), 4 6 [0,1]. (1) 
V n 

The noise parameter a > is assumed to be known, W is a standard Wiener process, and / £ L 2 [0, 1] is 
the unknown response function. Under some smoothness constraints on /, such a model is asymptotically 
equivalent in Le Cam sense to the standard nonparametric regression setting (Brown & Low, 1996). 

In a consistent estimation theory, it is well-known that / should possess some smoothness properties. 
We assume that / belongs to a Besov ball Bt, q {M) of a radius M > 0, where < p, q < oo and 
s > max(0, 1/p— 1/2). The latter restriction ensures that the corresponding Besov spaces are embedded 
in L 2 [0, 1]. The parameter s measures the degree of smoothness while p and q specify the type of norm 
used to measure the smoothness. Besov classes contain various traditional smoothness spaces such as 
Holder and Sobolev spaces as special cases. However, they also include different types of spatially 
inhomogeneous functions (Meyer, 1992). 

The fact that wavelet series constitute unconditional bases for Besov spaces has caused various 
wavelet-based estimation procedures to be widely used for estimating the unknown response / £ 5* (M) 
in the Gaussian white noise model CO- The standard wavelet approach for the estimation of / is based 
on finding the empirical wavelet coefficients of the data and denoising them, usually by some type of 
a thresholding rule. Transforming them back to the function space then yields the resulting estimate. 
The main statistical challenge in such an approach is a proper choice of a thresholding rule. A series 
of various wavelet thresholds originated by different ideas has been proposed in the literature during the 
last decade, e.g., the universal threshold (Donoho & Johnstone, 1994a), Stein's unbiased risk estimation 
threshold (Donoho & Johnstone, 1995), the false discovery rate threshold (Abramovich & Benjamini, 
1996), the cross-validation threshold (Nason, 1996), the Bayes threshold (Abramovich et al., 1998) and 
the empirical Bayes threshold (Johnstone & Silverman, 2005). 

Abramovich & Benjamini (1996) demonstrated that thresholding can be viewed as a multiple hy- 
pothesis testing procedure, where one first simultaneously tests the wavelet coefficients of the unknown 
response function, for significance. The coefficients concluded to be significant are then estimated by the 
corresponding empirical wavelet coefficients of the data, while the non-significant ones are discarded. 
Such a "testimation" procedure evidently mimics a hard thresholding rule. Various choices for adjust- 
ment to multiplicity on the testing step lead to different thresholds. In particular, the universal threshold 
of Donoho & Johnstone (1994a) and the false discovery rate threshold of Abramovich & Benjamini 
(1996) fall within such a framework corresponding to Bonferroni and false discovery rate multiplicity 
corrections, respectively. 
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In this paper, we proceed along the lines of "testimation" approach, where we utilize the recently 
developed maximum a posteriori Bayesian multiple testing procedure of Abramovich & Angelini (2006). 
Their hierarchical prior model is based on imposing a prior distribution on the number of false null 
hypotheses. Abramovich et al. (2007) applied this approach to estimating a high-dimensional Gaussian 
mean vector and showed its minimax optimality where the unknown mean vector was assumed to be 
sparse. 

We first extend the results of Abramovich et al. (2007) to more general settings. Consider the problem 
of estimating an unknown high-dimensional Gaussian mean vector, where one observes yi governed by 

yi = m + a n Zi, i = l,2,...,n. (2) 

The variance a\ > 0, that may depend on n, is assumed to be known, z\ are independent iV(0, 1) random 
variables, and the unknown mean vector fj, = (fii, . . . , fj, n )' is assumed to lie in a strong Z p -ball Z p [ry n ], 
< p < oo, of a normalized radius r] n , that is, \ \n\\ p < C n , where C n = n 1 / p a n r] n . Abramovich et 
al. (2007) considered the Gaussian sequence model © with a\ = a 2 and derived upper error bounds 
for the quadratic risk of an adaptive Bayesian maximum a posteriori estimator of /x in the sparse case, 
where < p < 2 and r\ n — > as n — ► oo. We extend their results for all combinations of p and r\ n and 
for the variance in © that may depend on n. We show, in particular, that for a properly chosen prior 
distribution on the number of non-zero entries of fi, the corresponding estimator, up to a constant factor, 
is asymptotically minimax for almost all l p -balls including both sparse and dense cases. 

We then apply the proposed approach to the wavelet thresholding estimation in the Gaussian white 
noise model £T|). We show that, under mild conditions on the prior distribution on the number of non- 
zero wavelet coefficients, the resulting global wavelet estimator of /, up to a logarithmic factor, attains 
the minimax convergence rates simultaneously over the entire range of Besov balls. Furthermore, we 
demonstrate that estimating wavelet coefficients at each resolution level separately, allows one to remove 
the extra logarithmic factor. Moreover, the procedure can also be extended to the estimation of derivatives 
of /. These results, in some sense, complement the adaptively minimax empirical Bayes estimators of 
Johnstone & Silverman (2005). 

2 Estimation in the Gaussian sequence model 
2.1 Bayesian maximum a posteriori estimation procedure 

We start with reviewing the Bayesian maximum a posteriori estimation procedure for the Gaussian se- 
quence model © developed by Abramovich et al. (2007). 

For this model, consider the multiple hypothesis testing problem, where we wish to simultaneously 
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test 



Hoi : m = versus H u : m ^ 0, i 



= 1,2,..., 



n. 



A configuration of true and false null hypotheses is uniquely defined by the indicator vector x = 
(xi, x n )' , where Xi = 7^ 0) and 1(A) denotes the indicator function of the set A. Let k = 
x\ + ... + x n = be the number of non-zero fii, i.e., ||^||o = #{i '■ Hi 7^ 0}. Assume some 

prior distribution 7r n on k with ir n (K) > 0, k = 0, . . . , n. For a given k, all the corresponding different 
vectors x are assumed to be equally likely a priori, that is, conditionally on k, 



Naturally, \ii \ x i = ~ ^o> where 5o is a probability atom at zero. To complete the prior specification, 
we assume that \ii I x i = 1 ~ -W(0> T n)- 

For the proposed hierarchical prior, the posterior probability of a given vector x with k non-zero 
entries is 



and 7 n = r^/cr^ is the variance ratio (Abramovich & Angelini, 2006). 

Given the posterior distribution ir n (x,K \ y), we apply the maximum a posteriori rule to choose 
the most likely indicator vector. Generally, to find the posterior mode of 7r n (j;, k \ y), one should look 
through all 2 n possible sequences of zeroes and ones. However, for the proposed model, the number of 
candidates for a mode is, in fact, reduced to n + 1 only. Indeed, let x(k) be a maximizer of ([3]) for a fixed 
k that indicates the most plausible vector x with k non-zero entries. From ©, it follows immediately that 
Xi(n) = 1 at the k entries corresponding to the smallest Bayes factors Bi and zeroes otherwise. Due to 
the monotonicity of Bi in \y\i in Q, it is equivalent to setting Xi(n) = 1 for the k largest \y\i and zeroes 
for others. The proposed Bayesian multiple testing procedure then leads to finding k that maximizes 





(3) 



where the Bayes factor Bi of Hoi is 




(4) 



logvr n (£(K),K \y) =c+J2y(i) + 2a n( l + V7n)log 



1=1 



K 




vr n (K)(l+ 7 n)- K/2 



for some constant c or, equivalently, minimizes 
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where \y\(u > ■ ■ ■ > |y|(„)- The k null hypotheses corresponding to . . . , \y\(k) aTe rejected. The 

resulting Bayesian estimation yields a hard thresholding with a threshold Amap = |y|(«)> i- e -> 

( Vi, \yi\ > Amap, 
= < (5) 
[ 0, otherwise. 

If k = 0, then all y^, i = 1, 2, . . . , n, are thresholded and fx = 0. 

From a frequentist view, the above estimator fi = {fn, . . . , /i n )' in © is evidently a penalized 
likelihood estimator with the complexity penalty 

P n [K) = 2^(1 + l/ 7 n) bg { ("W(*)(l + 7n) K/2 } • (6) 

In this sense, it can be also considered within the framework of Birge & Massart (2001). We will discuss 
these relations in the following section in more details. 

2.2 Upper error bounds 

Abramovich et al. (2007, Theorem 6) obtained upper error bounds for the Z 2 -risk of (f5]) in the Gaussian 
sequence model (|2]) for sparse l p [r]n] -balls, where < p < 2 and r\ n — > as n — > oo. We extend now 
these results to more general settings. 

Fix a prior distribution tt ti (k) > 0, k = 0, . . . , n, on the number of non-zero entries of /u, and let 
7n = T n / °n b e the variance ratio. 

Proposition 1 Let fi be the estimator (0 of fi in the Gaussian sequence model (0), where ll 6 
< p < oo. Asswrne ?/iere exist positive constants 7_ 7+ smc/j rta? 7_ < 7 n < 7+. 

1. Le? < p < 00. Assume that vr n (n) > e~ c ° n for some cq > 0. Then, as n — > 00, 

sup ^(HA — Mill) = O(ncr^). 

A»£/p fan] 

2. Le? 2 < p < 00. Asswme that there exists (3 > smc/z f/zaf 7r n (0) > n~ Cl ™ /or some ci > 0. 
Then, as n — ► 00, 

sup - //|||) = 0{a 2 n nr] 2 n ) + 00 2 n _/3 log n). 

A»G£p fan] 

3. Le£ < p < 2. Asswrae 7r n (K) > (K/n) C2K for all k = 1,2, ... , t%n, where n _1 (2 log n) p / 2 < 
a n < exp{— c(7 n )}, c(7 n ) = 8(7„ + 3/4) 2 > 9/2, and for some C2 > 0. TTzerc, as n — » 00, 

sup £(||£ - Mill) = 0(^(2 log r^) 1 "^ 2 } 

AtSip fan] 

foralln- l {2\ogny/ 2 < rfc < a n . 
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4. Let < p < 2. Assume that there exists (5 > such that 7r n (0) > n Cin 13 for some c\ > 0. Then, 
as n — > oo, 

sup E(\\ft - Mill) = 0(o*n 2 /*rfc) + 0(a 2 n n^ log n) 
for all rf n < re" 1 (2 log nfl 2 . 

The proof of Proposition [J is given in the Appendix. Similar to Abramovich et al. (2007), analogous 
results can be obtained for other types of balls, e.g., weak l p -balls, < p < oo, and Zo-balls, with 
necessary changes in the proofs (Petsa, 2009, Chapter 3). 

Since the prior assumptions in Proposition Q] do not depend on the parameters p and r\ n of the / p -ball, 
the estimator ([5]> is inherently adaptive. The condition on 7r n (n) guarantees that its risk is always bounded 
by an order of na 2 , corresponding to the risk of the maximum likelihood estimator, flf ILE = y%, in the 
Gaussian sequence model ©. 

The following corollary of Proposition Q] essentially defines dense and sparse zones for 2 < p < oo, 
and dense, sparse and super-sparse zones for < p < 2 of different behavior for the quadratic risk 
of the proposed estimator (f5]). To evaluate its accuracy, we also compare the resulting risks with the 
corresponding minimax risks R(l p [r) n ]) = inf^ sup^gj E(\\p, — fx\\ 2 ) that can be found, e.g., in 
Donoho & Johnstone (1994b). In what follows g\(n) >c gi (n) denotes < liminf{g 1 (n)/g 2 (ra)} < 
limsup{gi(n)/<72(w)} < oo as n — > oo. 

Corollary 1 Let fi be the estimator (0 of \i in the Gaussian sequence model (O, where u £ l p [Tj n ], 
< p < oo. Assume that there exist positive constants 7_ and 7 + such that 7_ < j n < 7 + . Define 
c(7n) = 8(7n + 3/4) 2 > 9/2 and let the prior n n satisfy the following conditions: 

1. 7r n (0) > n~ Cin ~ P for some j3 > and c x > 0; 

2. 7t„(k) > (K/n) C2K for all k = 1,2, ... , an, where a = exp(— 9/2) or a = exp{— c(7_)} if 7_ is 
known, and for some C2 > 0; 

3. 7r„(n) > e~ c ° n for some c$ > 0. 

Then, as n — > oo, depending on p and rj n , one has: 
Case 1. Let < p < oo, ry^ > a. 77je«, 

sup E(\\fi- n\\ 2 2 ) = 0{n<j 2 n ), R{l p [r] n ]) na 2 n 

MG'p fan] 

Case 2. Le? 2 < p < oo, rf^ < a. Then, 

sup E (\\ fi - n\\l) = 0(a 2 l nnl)+0(a 2 l n- 13 log n), R{l p [r] n \) ^ a 2 n nn 2 n 

fl&p fan] 
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Case 3. Let 0<p< 2, n~ 1 (2 log nfl 2 <rf n <a. Then, 

sup E(\\fi - = 0{^n^(21og^) 1 ^/ 2 }, R(l p [r, n ]) x ^n<(21ogr/^) 1 -f/ 2 
Aieip [»7h] 

Care 4. Let < p < 2, rft < n _1 (2 log n) p/2 . TTien, 

sup J B(||A-^||^) = 0( ( T 2 n 2 /fr ? 2 ) + 0(a 2 n-' 3 logn), fl(Z p [ % ]) x a 2 n 2 /^ 2 

For P = one can easily verify that all three conditions of Corollary Q] are satisfied, for example, for the 
truncated geometric prior TrGeom(l — q), < q < 1, where 7r n (K) = (1 — q)q K /{l — q n+1 ), k = 
0, . . . , n. On the other hand, for any /?, no binomial prior Bin(n, p n ) can "kill three birds with one stone". 
The requirement 7r n (0) = (1 — p n ) n > n~ cin p necessarily implies p n — > as n — > oo. However, to 
satisfy vr n (n) = p™ > e _c ° n , one needs p n > e~ c °. 

The impact of Corollary [T]is that, up to a constant multiplier, the proposed estimator © is adaptively 
minimax for almost all l p -balls, < p < oo, except those with very small normalized radiuses, where 
r/ 2 = o(n -( ^ +2 / mm ( p ' 2 )) logn). Hence, while the optimality of most the existing threshold estimators, 
e.g., universal, Stein's unbiased risk, false discovey rate, has been established only over various sparse 
settings, the Bayesian estimator © is appropriate for both sparse and dense cases. To the best of our 
knowledge, such a wide adaptivity range can be compared only with the penalized likelihood estimators 
of Birge & Massart (2001) and the empirical Bayes threshold estimators of Johnstone & Silverman 
(2004b, 2005); see Petsa (2009, Chapter 3) for more details. 

In fact, as we have mentioned, there are interesting asymptotic relationships between the Bayesian 
estimator ((5]> and the penalized likelihood estimator of Birge & Massart (2001) that may explain their 
similar behavior. For estimating the normal mean vector in © within /p-balls, Birge & Massart (2001) 
considered a penalized likelihood estimator with a specific complexity penalty 

P n («) = Ca 2 n K{\ + y/{2L K )} 2 ) (7) 

where L K = log(n/«) + (1 + 0)(1 + log(n)//c) for fixed C > 1 and 9 > (Birge & Massart, 2001, 
Section 6.3). For large n and k < n/e, this penalty is approximately of the following form: 

P„(k) ~ 2<j 2 n CKL K ~ 2a 2 n c x jlog ("J + c 2 k\ (8) 

for some positive constants c, c\, C2 > 1; see also Lemma Q] in the Appendix. Thus, within this range, 
Pn in 0-® behaves in a way similar to a particular case of the penalty P n in © corresponding to the 
geometric type prior 7r n (K) oc (l/c2) K . This prior satisfies the second condition on 7r n of Corollary Q] 
Such a Bayesian interpretation can also be helpful in providing some intuition behind the penalty P n 
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motivated in Birge & Massart (2001) mostly due technical reasons. In addition, under the conditions of 
Corollary [T] P n (n) ~ P„(n) ~ cn. 

Furthermore, for sparse cases, where k <C n, under the conditions on the prior 7r n of CorollaryQ] both 
penalties P n and P n are of the same so-called 2k log(n/ft)-type penalties of the form 2cr^«;{log(n/«;) + 
Cn,n}, where C > 1 and c K . n is negligible relative to log(n/«). Such type of penalties has appeared within 
different frameworks in a series of recent works on estimation and model selection (Foster & Stine, 1999; 
George & Foster, 2000; Birge & Massart, 2001; Abramovich et al., 2006; Abramovich et al., 2007). 

3 Bayesian maximum a posteriori wavelet estimation in the Gaussian white 
noise model 

3.1 General algorithm 

In this section we apply the results of Section [2] on estimation in the Gaussian sequence model (O to 
wavelet estimation of the unknown response function / in the Gaussian white noise model (Q]). 

Given a compactly supported scaling function <fi of regularity r > s and the corresponding mother 
wavelet ifi, one can generate an orthonormal wavelet basis on the unit interval from a finite number Cj 
of scaling functions (j)j k at a primary resolution level jo and wavelets tpjk at resolution levels j > jo and 
scales k = 0, . . . , 2-? ' — 1 (Cohen et al., 1993; Johnstone & Silverman, 2004a). For clarity of exposition, 
we use the same notation for interior and edge wavelets, and in what follows denote 4>j k by ipj -i } k- 

Then, / is expanded in the orthonormal wavelet series on [0, 1] as 

oo 2*' -1 
j=j -l fc=0 

where 9jk = Jq f(t)ipjk(t)dt. In the wavelet domain, the Gaussian white noise model (Q]) becomes 

Yjk = 8jk + ejk, j > jo - 1, k = 0, . . . , 2 3 - 1, 

where the empirical wavelet coefficients are given by = f} t()jk(t)dY(t) and are independent 
iV(0, a 1 In) random variables. 

Define J = log 2 n. Estimate wavelet coefficients 9jk at different resolution levels j by the following 
scheme: 

1. set 9j -i,k = Yj -i,fc; 

2. apply the Bayesian estimation procedure of Abramovich et al. (2007) described in Section [2] to 
estimate Oj^ at resolution levels jo < j < J by the corresponding Oj^; 
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3. set § jk = 0, j > J. 
The resulting wavelet estimator f n of / is then defined as 

f n (t) = ij-o-i,fc^ -i,fc(*) + EE Mikity (9) 

fc=o i=io fc=o 

Theorem Q] below shows that, under mild conditions on the prior 7r n , the resulting global wavelet 
estimator (© of /, where the estimation procedure is applied to the entire set of wavelet coefficients at 
all resolution levels jo < 3 < J, up to a logarithmic factor, attains the minimax convergence rates over 
the whole range of Besov classes. Furthermore, Theorem [2] demonstrates that performing the estimation 
procedure at each resolution level separately allows one to remove the extra logarithmic factor. Moreover, 
a level-wise version of (© allows one to estimate the derivatives of / at optimal convergence rates as well. 

3.2 Global wavelet estimator 

The number of wavelet coefficients at all resolution levels up to J is n = 2 J — 2- 70 ~ n for large n. Let 
7i"n(^) > 0, k = 0, . . . , n, be a prior distribution on the number of non-zero wavelet coefficients of / at 
all resolution levels jo < j < J, and let the prior variance of non-zero coefficients at the jth resolution 
level be Tj/n; the corresponding level-wise variance ratios are jj = r? /a 2 . 

It is well-known (Donoho & Johnstone, 1998) that, as n — > oo, the minimax convergence rate for 
the L 2 -risk of estimating the unknown response function / in the model ([]]) over Besov balls Bp JM), 
where < p, q < oo, s > max(0, 1/p — 1/2) and M > 0, is given by 

inf sup E{\\f n -f\\l)^n- 2s l^ +l \ 
f n /efi|,,(M) 



Theorem 1 Let ip be a mother wavelet of regularity r and let f n be the corresponding global wavelet 
estimator ([9]) of f in the Gaussian white noise model 0, where f E Bp AM), < p, q < oo, 1/p < 
s < r and M > 0. Assume that there exist positive constants 7_ and 7+ such that 7_ < jj < 7+ for all 
j = ioj • • • > J — 1- Let the prior ir n satisfy 7r n (fc) > {n/n) CK for all k = 1,2,..., exp(— 9/2)n or, for a 
shorter range k = 1, 2, . . . , exp{— c(7_)}n ifj- is known. Then, as n — > 00, 

sup «(||/.-/|g)-o(f!SI2)*}. (10) 

The proof of Theorem[T]is based on the relationship between the smoothness conditions on functions 
within Besov spaces and the conditions on their wavelet coefficients. Namely, if / € Bp q (M), then 
the sequence of its wavelet coefficients {9jk, k = 0, . . . , 2 J ' — 1, j = jo, . . . , J — 1} belongs to 
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a weak ^2/(2s+i)"ball of a radius aM, where the constant a depends only on a chosen wavelet basis 
(Donoho, 1993, Lemma 2). One can then apply the corresponding results of Abramovich et al. (2007) 
for estimation over weak l p -balls. Details of the proof of Theorem[T]are given in the Appendix. 

The resulting global wavelet estimator does not rely on the knowledge of the parameters s, p, q and 
M of a specific Besov ball and it is, therefore, inherently adaptive. Theorem Q] establishes the upper 
bound for its L 2 -risk and shows that the resulting adaptive global wavelet estimator is asymptotically 
near-optimal within the entire range of Besov balls. In fact, the additional logarithmic factor in (ITOb is 
the unavoidable minimal price for adaptivity for any global wavelet threshold estimator (Donoho et al., 
1995; Cai, 1999), and in this sense, the upper bound for the convergence rates in (fTOl ) is sharp. To remove 
this logarithmic factor one should consider level-wise thresholding. 

3.3 Level-wise wavelet estimator 

Consider now the level-wise version of the wavelet estimator Q, where estimation is applied separately 
at each resolution level j. The number of wavelet coefficients at the jth resolution level is nj = 2 J . Let 
7Tj(k) > 0, k = 0, . . . , 2 J , be the prior distribution on the number of non-zero wavelet coefficients, and 
let Tj /n be their level- wise prior variance, jo < j < J; the corresponding level- wise variance ratios are 

Theorem 2 Let ip be a mother wavelet of regularity r and let f n (') be the corresponding level-wise 
wavelet estimator & of f in the Gaussian white noise model ([7]), where f E Bp AM), < p,q < oo, 
1/p < s < r and Al > 0. Assume that there exist positive constants 7_ and 7+ such that 7_ < jj < 7+ 
for all j = jo, . . . , J — 1. Let the priors ttj satisfy the following conditions for all j = jo, . . . , J — 1: 

1. nj(0) > 2~ Clj for some c\ > 0; 

2. 7Tj(«) > (k2~3) C2K for all k = 1,2,..., ctj2 J , where C2 > and < c a < atj < exp{— c(jj)} 
for some constant c a > 0, and the function c(jj) = 8(7^ + 3/4) 2 was defined in Proposition^ 

3. nj(2?) > e- co2J for some c > 0. 
Then, as n — > 00, 

sup E(\\f n -f\\l)=0(n-^T-A. 

feBi q (M) v ' 

For / G Bp q (M), the sequence of its wavelet coefficients at the jth resolution level belongs to l p [r]j], 
where r)j = C n 1 / 2 2-^ s+1 / 2 ) for some Co > (Meyer, 1992, Section 6.10). The conditions on the prior 
in Theorem|2]ensure that all the four statements of the Proposition Q] simultaneously hold at all resolution 
levels jo < j < J with (3 = 0, and one can exploit any of them at each resolution level. It is necessary 
for adaptivity of the resulting level-wise wavelet estimator (©. 
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As we have mentioned in Section 12.21 all three conditions of Theorem |2] hold, for example, for the 
truncated geometric prior TrGeom(l — qj), where qj are bounded away from zero and one. 

It turns out that requiring a slightly more stringent condition on tvj(0), allows one also to estimate 
derivatives of / by the corresponding derivatives of its level-wise wavelet estimator f n at the optimal 
convergence rates. Such a plug-in estimation of by fn is, in fact, along the lines of the vaguelette- 
wavelet decomposition approach of Abramovich & Silverman (1998). 

Recall that, as n — ► oo, the minimax convergence rate for the L 2 -risk of estimating an mth derivative 
of the unknown response function / in the model (Q]) over Besov balls BpJM), where < m < 
min{s, (s + 1/2 — l/p)p/2}, < p, q < oo and M > 0, is given by 

inf SUp E{\\}^ - f^\\l) X „-2(«-m)/(2«+l) 

fk m) feB' i9 (M) 

(Donoho et al., 1997; Johnstone and Silverman, 2005). 

The following Theorem [3] is a generalization of Theorem [2] for simultaneous level-wise wavelet 
estimation of a function and its derivatives. 

Theorem 3 Let tp be a mother wavelet of regularity r and let f n be the level-wise wavelet estimator 
d2]) of f in the Gaussian white noise model ([7]), where f G BpJM), < p,q < oo, 1/p < s < r 
and M > 0. Assume that there exist positive constants 7_ and 7+ such that 7_ < 7j < 7+ for all 
j = jo, . . . , J — 1. Let the priors ttj satisfy the following conditions for all j = jo, . . . , J — 1: 

1. TTj(0) > 2- cl i 2 ~ 0j for some (3 > and a > 0; 

2. 7Tj(«) > (k2~3) C2K for all k = 1,2, ... , afl 3 , where C2 > and < c a < ctj < exp{— 0(7^)} 
for some constant c a > 0, and the function c(7j) = 8(7^ + 3/4) 2 was defined in Proposition^ 

3. Wj(2?) > e- co2J for some c > 0. 

Then, for all mth derivatives /( m ) off, where < m < [3/2 and m < min{s, (s + 1/2 — l/p)p/2}, as 
n — > oo, 

■V \ i \ o / 2(s — m) 

sup E(\\f^)-f(™)\\t)=o(n-^W- 

Theorem |2]is evidently a particular case of Theorem [^corresponding to the case m = 0, for [3 = in 
the condition on 7Tj(0). Theorem [3] shows that the same proposed adaptive level- wise wavelet estimator 
(O is simultaneously optimal for estimating a function and an entire range of its derivatives. This range 
is the same as that for the empirical Bayes shrinkage and threshold estimators appearing in Theorem 1 
of Johnstone & Silverman (2005). The proof of Theorem[3]is given in the Appendix. 
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4 Numerical Study 



4.1 Preamble 

In this section, we present a simulation study to illustrate the performance of the developed level-wise 
wavelet estimator Q and compare it with three empirical Bayes wavelet estimators: the posterior mean 
and the posterior median of Johnstone & Silverman (2005), and the Bayes Factor of Pensky & Sapatinas 
(2007); and two other estimators: the block wavelet estimator NeighBlock of Cai & Silverman (2001) 
and the complex-valued wavelet hard thresholding estimator of Barber & Nason (2004). All the above 
Bayesian estimators and the block wavelet estimator are asymptotically minimax in a wide range of 
Besov balls. Although no such theoretical results have been established so far for the complex-valued 
wavelet estimator, it has performed well in simulations (Barber & Nason, 2004). 

In practice, one typically deals with discrete data of a sample size n and the sampled data analog of 
the Gaussian white noise model £T|) is the standard nonparametric regression model 

Yi = f(i/n) + ei, i = l,2, ...,n, 

where are independent N(0, a 2 ) random variables. The corresponding global and level-wise Bayesian 
maximum a posteriori wavelet estimation procedures then use the empirical wavelet coefficients obtained 
by the discrete wavelet transforms of the data. However, utilizing the machinery of Johnstone & Silver- 
man (2004a, 2005) for development of appropriate boundary-corrected wavelet bases, one can show that 
discretization does not affect the order of magnitude of the accuracy of the resulting wavelet estimates 
(Johnstone & Silverman, 2004a, 2005; Petsa, 2009, Chapter 3). 

The computational algorithms were performed using the WaveLab and EbayesThresh software. The 
entire study was carried out using the Matlab programming environment. 

4.2 Estimation of parameters 

To apply the proposed level- wise wavelet estimator Q one should specify the priors nj , the noise vari- 
ance a 2 and the prior variances t 2 or, equivalently, the variance ratios = t 2 / a 2 . We used the truncated 
geometric priors TrGeom(l — qj) discussed in Section l331 Since the parameters a 2 , qj and jj are rarely 
known a priori in practice, they should be estimated from the data in the spirit of empirical Bayes. 

The unknown a was robustly estimated by the median of the absolute deviation of the empirical 
wavelet coefficients at the finest resolution level J — 1, divided by 0.6745 as suggested by Donoho & 
Johnstone (1994a), and usually applied in practice. For a given a, we then estimate qj and jj by the 
conditional likelihood approach of Clyde & George (1999). 

Consider the prior model described in Section 12.11 The corresponding marginal likelihood of the 
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observed empirical wavelet coefficients, say Yjk, at the jth resolution level is then given by 



7j Efc a*fe*7 



2a 2 (l + 7i ) 



/2A 

L{q j , lj ;Y j )^Y J ^M)[ K ) ( 1 +1j)~ k/2 ex P 

where 7Tj(«) = (1 — qj)q^/{l — qf +1 ) and Xj are indicator vectors. Instead of direct maximization of 
L(qj,'jj;Yj) with respect to ^ and jj, regard the indicator vector x as a latent variable and consider the 
corresponding log-likelihood for the augmented data (Yj, x), i.e., 

K<H>1j\ Y 3i x ) =c + bg7rj(K)-log^^ - ^ log(l + 7j) + ^* 3 * , (11) 

where c is a constant. The EM-algorithm iteratively alternates between computation of the expectation of 
l(qj, jj) Yj,x) in (TTTT t with respect to the distribution of x given Yj evaluated using the current estimates 
for the parameters' values at the E-step, and updating then the parameters by maximizing it with respect 
to qj and jj at the M-step. However, for a general prior distribution ir n and for the truncated geometric 
prior, in particular, the EM-algorithm does not allow one to achieve analytic expressions on the E-step. 
Instead, we apply the conditional likelihood estimation approach originated by George & Foster (2000) 
and adapted to the wavelet estimation context by Clyde & George (1999). The approach is based on 
evaluating the augmented log-likelihood (TTTb at the mode for the indicator vector x at the E-step rather 
than using the mean as in the original EM-algorithm (Abramovich & Angelini, 2006). 

For a fixed number k of its non-zero entries, it is evident from (fTTT) that the most likely vector x(k) 
is Xi{n) = 1 for the k largest \Yj^\ and zero otherwise. For the given k, maximizing (fTTT ) with respect to 
7j after some algebra yields 7j(rc) = max jo, YHk=i Y (k)/( Ka ' 2 ) ~ T° simplify maximization with 
respect to qj, approximate the truncated geometric distribution ttj in (fTTT > by a non-truncated one. This 
approximation does not strongly affect the results, especially at sufficiently high resolution levels, and 
allows one to obtain analytic solutions for qj, i.e., qj(n) =«/(« + 1)- It is now straightforward to find k 
that maximizes (fTTT > together with the corresponding •Jj(k) and qj(k). The above conditional likelihood 
approach results therefore in rapidly computable estimates for and qj in closed forms. 

4.3 Simulation study 

We now present and discuss the results of the simulation study. For all three empirical Bayes wavelet esti- 
mators, we used the Double-exponential prior, where the corresponding prior parameters were estimated 
level-by-level by marginal likelihood maximization, as described in Johnstone & Silverman (2005). The 
prior parameters for the proposed level-wise wavelet estimator © were estimated by conditional like- 
lihood maximization procedure described in Section 14.21 above. For the block wavelet estimator, the 
lengths of the blocks and the thresholds were selected as suggested by Cai & Silverman (2001). Finally, 
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Figure 4. 1 : Wave (left) and Peak (right) test functions 

for all competing methods, a was estimated by the median of the absolute value of the empirical wavelet 
coefficients at the finest resolution level divided by 0.6745 as discussed in Section |4~2l 

In the simulation study, we evaluated the above six wavelet estimators for a series of test functions. 
We present here the results for the nowadays standard Bumps, Blocks, Doppler and Heavisine functions 
of Donoho & Johnstone (1994a), and Wave (Marron et al., 1998; Antoniadis et al., 2001) and Peak 
(Angelini et al., 2003) functions defined, respectively, as 

f(t) = 0.5 + 0.2 cos Ant + 0.1 cos 24vrt, < t < 1 

and 

/(i) = exp{-|£-0.5|}, 0<t<l. 

See Figure RTTI for Wave and Peak test functions. 

For each test function, M = 100 samples were generated by adding independent Gaussian noise 
e ~ N(0, (J 2 ) to n = 256, 512 and 1024 equally spaced points on [0,1]. The value of the root signal-to- 
noise ratio was taken to be 3, 5 and 7 corresponding respectively to high, moderate and low noise levels. 
The goodness-of-fit for an estimator / of / in a single replication was measured by its mean squared 
error. 

For brevity, we report the results only for n = 1024 using the compactly supported mother wavelet 
Coiflet 3 (Daubechies, 1992, p.258) and the Lawton mother wavelet (Lawton, 1993) for the complex- 
valued wavelet estimator. The primary resolution level was jo = 4. Different choices of sample sizes 
and wavelet functions basically yielded similar results in magnitude. 
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The sample distributions of mean squared errors over replications for different wavelet estimators 
in the conducted simulation study were typically asymmetrical and affected by outliers. Therefore, 
we preferred the sampled medians of mean squared errors rather than means to gauge the estimators' 
goodness-of-fit. Thus, for each wavelet estimator, test function and noise level, we calculated the sample 
median of mean squared errors over all 100 replications. To quantify the comparison between the com- 
peting wavelet estimators over various test functions and noise levels, for each model we found the best 
wavelet estimator among the six, i.e., the one achieving the minimum median mean squared error. We 
then evaluated the relative median mean squared error of each estimator defined as the ratio between the 
minimum and the estimator's median mean squared errors; see Table 14. ll 

As expected, Table 14. ll shows that there is no uniformly best wavelet estimator. Each one has its own 
favorite and challenging cases, and its relative performance strongly depends on a specific test function. 
Thus, the complex-valued estimator indeed demonstrates excellent results for Donoho & Johnstone's 
functions as it has been reported in Barber & Nason (2004), but is much less successful for Peak and 
Wave. The block estimator is the best for the Peak and Doppler but the worst for Blocks and Bumps. 
The proposed Bayesian estimator Q outperforms others for Wave but is less efficient for Donoho & 
Johnstone's (1994a) examples. Interestingly, the relative performance of the estimators is much less 
sensitive to the noise level. For each of the test functions, the corresponding best estimator is essentially 
the same for all noise levels. 

The minimal relative median of mean squared errors of an estimator over all cases reflects its inef- 
ficiency at the most challenging combination of a test function and noise level, and can be viewed as a 
natural measure of its robustness. In this sense, the posterior mean estimator is the most robust although 
it is not the winner in any particular case. 

As a "by-product", we also compared different thresholding estimators in terms of sparsity measured 
by the average percentage of non-zero wavelet coefficients which remained after thresholding; see Table 
14.21 The posterior mean estimator was not included in this comparison since it is a non-linear shrinkage 
but not a thresholding estimator. Similar to the previous results on the goodness-of-fit, the relative 
sparsity strongly depends on the test function. However, except for the Doppler example, the estimator 
© is consistently the most sparse among Bayesian estimators. 

Apart from providing a theoretical justification, the presented numerical results show that the pro- 
posed estimator demonstrates good performance in finite sample settings and can, therefore, be viewed 
as a contribution to the list of useful wavelet-based function estimation tools. 
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Table 4.1: Relative median mean squared errors (MSE) for various test functions, levels of the root 
signal-to-noise ratio (RSNR) and different wavelet estimators: the proposed Bayesian estimator (MAP), 
Bayes Factor (BF), posterior median (Postmed), posterior mean (Postmean), block (Block) and complex- 
valued hard thresholding (CW). 



signal 


RSNR 


MAP 


BF 


Postmed 


Postmean 


Block 


CW 


Peak 


3 


0.8697 


0.1763 


0.8279 


0.6589 


1 


0.5795 




5 


0.7772 


0.1497 


0.7864 


0.6525 


1 


0.6234 




7 


0.8033 


0.186 


0.8501 


0.6958 


1 


0.6979 


Wave 


3 


1 


0.5614 


0.9841 


0.9103 


0.4570 


0.9189 




5 


0.9841 


0.4603 


1 


0.9165 


0.6072 


0.8265 




7 


1 


0.6241 


0.9900 


0.9303 


0.7498 


0.7793 


Bumps 


3 


0.5968 


0.6254 


0.6814 


0.7569 


0.4769 


1 




5 


0.5221 


0.5641 


0.5893 


0.6671 


0.4788 


1 




7 


0.5132 


0.5537 


0.5707 


0.6420 


0.5202 


1 


Blocks 


3 


0.6595 


0.6807 


0.8815 


0.9500 


0.5606 


1 




5 


0.6875 


0.727 


0.8541 


0.9065 


0.4416 


1 




7 


0.6921 


0.7134 


0.7806 


0.8535 


0.4288 


1 


Doppler 


3 


0.7214 


0.611 


0.8277 


0.8709 


0.9878 


1 




5 


0.6962 


0.6739 


0.8116 


0.8583 


1 


0.9119 




7 


0.7655 


0.7122 


0.8236 


0.883 


1 


0.9382 


HeaviSine 


3 


0.7523 


0.3566 


0.9333 


0.9154 


0.8406 


1 




5 


0.6640 


0.3764 


0.8622 


0.8427 


0.5796 


1 




7 


0.6931 


0.3505 


0.8298 


0.8424 


0.5028 


1 
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Table 4.2: Average percentages of remaining coefficients for various test functions, levels of the root 
signal-to-noise ratio and different wavelet thresholding estimators. 



signal 


RSNR 


MAP 


BF 


Postmed 


Block 


cw 


Peak 


3 


7.57 


14.87 


8.92 


1.64 


1.74 




5 


5.62 


15.89 


8.39 


1.61 


1.93 




7 


4.26 


12.93 


7.81 


1.63 


2.05 


Wave 


3 


11.39 


18.81 


12.79 


5.21 


5.52 




5 


11.51 


20.19 


12.52 


6.30 


6.28 




7 


10.38 


20.84 


13.07 


6.30 


7.01 


Bumps 


3 


10.63 


12.13 


10.86 


16.63 


12.23 




5 


11.17 


12.60 


12.45 


21.13 


14.52 




7 


12.65 


13.90 


13.80 


23.70 


16.03 


Blocks 


3 


17.10 


15.32 


11.62 


12.27 


8.39 




5 


10.39 


12.20 


11.72 


18.03 


12.07 




7 


11.12 


12.87 


12.63 


22.47 


14.20 


Doppler 


3 


11.46 


14.73 


8.58 


5.69 


5.13 




5 


7.24 


9.23 


6.52 


6.63 


6.60 




7 


6.42 


9.58 


6.57 


7.27 


7.86 


HeaviSine 


3 


6.35 


19.27 


10.75 


1.98 


2.17 




5 


8.11 


19.73 


10.62 


3.17 


2.69 




7 


10.87 


18.52 


12.55 


4.00 


3.39 
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5 Appendix 

Throughout the proofs we use C to denote a generic positive constant, not necessarily the same each 
time it is used, even within a single equation. 

5.1 Proof of Proposition Q] 

We start the proof of the proposition with the following lemma that establishes the bounds for binomial 
coefficients: 

Lemma 1 For all n> 2 and k = 1, 2, . . . , n — 1, 

ny<(n\(ney_ (12) 



. K ) \kJ V K ) 

In particular, for k < n/e, 

'"]<[-) ■ (13) 

This lemma generalizes Lemma A.l of Abramovich et al. (2007), where the upper bound similar to 
that in (fl2l was obtained for k = o{n). 

Proof of Lemma [T| The obvious lower bound for the binomial coefficient in (fT2l has been shown 
in Lemma A.l of Abramovich et al. (2007). To prove the upper bound in (fT2l . note that using Stirling's 
formula one has 

n\ < ,ny ( ^_y" fey = ,ny 
kJ \eJ \n — k I \k/ Vk/ \n — Ki 



for all n > 2 and k = 1, 2, . . . , n — 1. 

Note that log{x/(2; — 1)} < l/(x — 1) for all x > 1. In particular, for x = n/n it implies 
log{n/(n — k)} < n/(n — k) and, therefore, (^r^) n_K < exp(ft) that together with ([141 completes the 
proof of (fT2l . 

The second statement (fT3T > of the lemma is an immediate consequence of (fT2l for k < n/e. This 
completes the proof of Lemma [T] 

We now return to the proof of Proposition Q] and consider separately all the four cases covered by 
the proposition. The proof will exploit the general results of Abramovich et al. (2007) on the upper 
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error bounds for the Z 2 -risk of the estimator (f5]) adapting them also for the case where the variance in the 
Gaussian sequence model ([2]) may depend on n. 

Case 1. Under the condition ir n > e~ c ° n , c > 0, the definition (O of jj, and ([6]) immediately imply 

\\y-v\\ 2 < I |l/-£| I 2 + Pn(k) <P„{n) = 0{nal). Thus, 

E{\\H - HI 2 ) < 2{E(\\y - A|| 2 ) + E(\\y - H| 2 )} = 0(na 2 n ). 
Case 2. Applying Corollary 1 of Abramovich et al. (2007) for k = yields 

n 

E(\\i^ - HID < C0(7n){ J> t ? + 2ff "(l + V7n) log^O)} + Ci( 7n ){l - 7r n (0)}o£, 

i=l 

where the exact expressions for co( 7n ) and ci( 7 „) are given in Theorem 2 of Birge & Massart (2001) 
with their K = 1 + l/(2 7n ); see the proof of Theorem 1 of Abramovich et al. (2007). In particular, 
under the assumptions of the proposition on the boundness of 7n , the functions co(7 n ) and ci(-y n ) are 
also bounded from above. For 2 < p < oo, the least favorable sequence fig that maximizes Y2i=l ^1 
over l p [rj n ] is /iqi = • • • = Hon = C n n~ l / p = rj n a n . As n — > oo, one then has 

- Mill) < co(7n) {a 2 77 2 n + 2a 2 (l + l/ 7n ) log tt" 1 ^)} + ci( 7n ){l - n n (0)}a 2 n 
= 0{alnr,l) + 0{oln-nogn). 

Case 3. This is essentially a sparse case considered in Abramovich et al. (2007) and its proof is a direct 
consequence of their Theorem 6. 

Case 4. The proof for this case is similar to that of Case 2 except that for < p < 2, the least favorable 
sequences /j,q that maximize J2i=i f 1 ! over I s IpiVn] are permutations of the spike (C n , 0, . . . , 0) and 
therefore Y^1=i Moi — a n n2 ^ Pr ln- Repeating the arguments used in the proof of Case 2 for k = 0, under 
the requirements of the proposition on boundedness of 7n , we then get as n — > oo, 

#(ll£ " Mill) < c ( 7 n) {^ 2/p r? 2 + 2(1 + l/ 7n )a 2 logTr-^O)} + ci( 7ft ){l - ^(0)}a 2 
= 0{<j 2 n n 2 /Vi 1 l) + 0{a 2 n n-nogn). 

for all rfc < n^ 1 (2 log nfl 2 . This completes the proof of TheoremQ] 
5.2 Proof of Theorem U 

Let i?j = Yl"k=o E{(8jk — 9jk) 2 }, j > jo — 1, be the L 2 -risk of the global wavelet estimator © at the 
jth resolution level. Due to the Parseval relation, E(\\f n — f\ | 2 ) = Y^,j>j -i Rj- Scaling coefficients are 
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not thresholded and therefore Rj -i = Cj a 2 n 1 = o(n 2s /( 2s + 1 )) as n — > oo. At very high resolution 
levels, where j > J, all wavelet coefficients 9jk are set to zero and, therefore, as n — > oo, 

oo oo 2 J — 1 

E^ = EE & = °(^ 2s ') = °(^ 2s/(2s+1) )> 

j=J j=J fc=0 

where s' = s + l/2 — 1/ min(p, 2) (Johnstone & Silverman, 2005). 

Consider now Y2j=j Rj- The set of wavelet coefficients of a function / £ B s p q (M) lies within 
a weak Z r -bail of a radius aM with r = 2/(2s + 1), where the constant a depends only on a chosen 
wavelet basis: m r [?7 n ] = {9 : \9\^ < (aM)r 1 / r } (Donoho, 1993, Lemma 2). The corresponding 
normalized radius r\ n = [a j \/n)~^n~ 1 l r aM = 0(n~ s ), where n = n — 2 J0 ~ n for large n. 

Under the conditions of the theorem, one can then apply Theorem 6 of Abramovich et al. (2007) for 
m r [r] n ] to get 

./-l f /, \ 2s/(2s+l) "I 

. — . r . i / log n x ' v 1 



fl,-< sup J E(||^-^||i) = o{r / ;(21ogr ? -) 1 - r / 2 }=o(f 

eem r [r; n ] k J [ V 



n 



as n — > oo. This completes the proof of Theorem Q] 
5.3 Proof of Theorem |3] 

Let -Rj = Ylt J =o E(9jk — 9jk) 2 , j > jo — 1, be now the L 2 -risk of the level-wise version of the 
wavelet estimator (© at the jth resolution level. Johnstone & Silverman (2005, Section 5.6) showed that 

m\A m) ~ f {m) \\ 2 )~Z 3 > 30 -i2 2m >Rr 

For any / £ q(M), the sequence of its wavelet coefficients at the jth resolution level belongs to 

a strong / p -ball of a normalized radius rjj = Con 1 / 2 2~^ s+l / 2 ^ for some Co > (Meyer, 1992, Section 

6.10). 

Define 

• 1 , ( nC o\ _J_, 

71 = loe 9 - ; ~ logo n. 

J 2s + 1 62 I c 2/p i 2s + 1 62 

For sufficiently large n, j\ > j . Note that rf- > c a for j < j\ and ^ < c Q for j > j'j with obvious 
modifications for p = oo. Consider the following cases: 

1. Scaling coefficients: j = jo — 1. Similarly to the global wavelet estimator, for a fixed primary 
resolution level j , 2 2m ^-^ R jo ^ = 0{n- 1 ) = ( n -2(a-m)/(2s+i)) as n _> oc _ 

2. Coarse resolution levels: jo < j < ji- Applying the first statement of Proposition Q] for each level one 
has 

J2 2 2mj Rj <CJ2 2 2mi «~Vn i < Cn- 1 jr 2( 2m+1 ^' = o(n-^ s - m ^ 2s+1 A 

3=30 3=30 3=30 
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as n — > oo. 

3. Middle and high resolution levels: j\ < j < J. Consider separately the cases (a) 2 < p < oo and (b) 

< p < 2. 

(a) 2 < p < oo. Under the conditions of the theorem, the second statement of Proposition Q] at the jth 
resolution level yields 

Rj < Cn~ x {njT]) + nj^lognj) < C{2- 2]S + n^T^ j) 

and, hence, as n — > oo, 

J-i 



Y, 2 2mj Rj < c(2- 2 ^ s - m ^ + n- 1 J 2 ) < c{n-^ s - m ^ 2s+ ^ + rT 1 log 2 n) 

= 0(V2(s-m)/(2s+l) N 



(b) < p < 2. Let j2 be the largest integer for which rf- > n- 1 (21ogrij) P/ ' 2 . One can easily verify that 

ji < 32 < J- 

Using the monotonicity arguments, rf- > nj 1 (2 log nj) p / 2 for all middle resolution levels j\ < 
j < 32- One can then apply the third statement of Proposition [Q and after some algebra, to get, for 
m < (a + 1/2 - l/p)p/2, 

j2 j2 ^ - l-p/2 



2 2m >R j <Cn- 1 ]T 2( 2m+1 ^n p / 2 2-^ s+1 / 2 ){log(n-P/ 2 2^ s+1 / 2 ))} 
3=ix+i j=ji+i 

< C' n -( 1 -P/ 2 )2^'iP( s + 1 /2-(2m+l)/p) log ^ n -p/2 2 j lP ( S +l/2)^ 
_ ^ n -2( S -m)/(2s+l)^ 

as n — > oo. 

At high resolution levels 22 < 3 < J,ffj < n ^ 1 {^ 1°§ n j ) P ^ 2 > an d tne fourth statement of Proposition 
Q] implies 

^ < C(2- 2 ^ s + 1 / 2 - 1 ^) + n- 1 2-^j). 

Hence, for < m < (3/2 and m < min{s, (s + 1/2 — l/p)p/2], one has 

J-i 

^ 2 2m ^- < C(2-2a2+l)(«+l/2-l/p-m ) + n -l j2^ =Si + 
J=J2 + 1 

where evidently S2 = 0(n -1 log! »i) = o(n~ 2 ( s ~ m )/( 2s+1 )) as n — > 00. From the definition of 
j 2 , 2(i2+l)(s+i/2-i/p) > y{nC/(j/ 2 + 1)} > V(nC/log 2 n), which after some algebra yields Si = 

( n -2( S -m)/(2 S +l)) as n ^ oo. 
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4. Very high resolution levels: j > J. Using the results of Johnstone & Silverman (2005), as n — > oo, 
the tailed sum 

J2 22mjR j = 0{n-^ s '-^) = (n- 2 ( s - m )/( 2s+1 )), 
j>J 

where s' = s + 1/2 — 1/ min(p, 2). Summarizing, 

^2 2 2mj Rj = 0( n - 2 ( s - m )/( 2s+1 )) 
i>jo-i 

as ra — > oo. This completes the proof of Theorem [3] 
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